Skip to content

Feat/rich oteL lmetrics#975

Merged
Manuthor merged 17 commits into
developfrom
feat/richOTELmetrics
Jun 17, 2026
Merged

Feat/rich oteL lmetrics#975
Manuthor merged 17 commits into
developfrom
feat/richOTELmetrics

Conversation

@HatemMn

@HatemMn HatemMn commented May 29, 2026

Copy link
Copy Markdown
Contributor

Overview

read the issue

How to read this PR

The 8 metrics that are asked to be integrated operate (almost) independently and each one of them is coded somewhere in the codebase.

In this PR, we will try to work on half of them :

Canonical Global Plan

1. Database metrics

commit: read below

  1. Database metrics wiring.
    Goal: wire database operations total and database operation duration at database entry points.
    Output: every database call path emits count and duration with normalized operation naming.
    Done when: database metrics are non-zero during normal CRUD and permission flows, with no double counting.

2. Http metrics

commit: read below

HTTP metrics wiring.
Goal: wire HTTP requests total, HTTP request duration, and active connections via middleware.
Output: one middleware layer that increments active connections on request start and decrements on completion, including error paths.
Done when: method/path/status request metrics appear and active connections returns to baseline after load bursts.

3. Object metrics

commit: read below

Objects total metric wiring.
Goal: wire objects total in the periodic metrics refresh job.
We want to wire both active_keys_count_value as well as kms.objects.total. A file on the matter alread exists : crate/server/src/cron.rs. It's wrong for multi admin deployments and shall be deleted

3.bis : fix the active_keys_count by changing/deleting that cron job and reusing the code of the previous step

4. HSM and Cache metrics

commit: read below

Cache and HSM metrics wiring.
Goal: wire cache operations total and hsm operations total.
Output:

  • cache metric emitted on hit/miss/insert or equivalent result taxonomy;
  • HSM metric emitted on oracle-backed crypto operations with reliable operation naming.

Human readable summarry of the technical work done, for revierwers :

all the text below was AI generated but it's actually relevant

e2b28745 — DB Metrics

Metrics: kms.db.operations.total, kms.db.operation.duration

What was done:

A DbMetricsRecorder trait was introduced in server_database as the only bridge between the database layer and the OTEL layer — keeping the dependency arrow strictly one-directional (server → server_database, never the reverse). OtelMetrics in the server crate implements that trait.

The Database facade gained a private generic record() helper method that wraps any async DB call: it starts a timer, awaits the future, determines "success" or "error" from the result, then calls the injected recorder with the operation name, backend kind ("sqlite", "postgresql", "mysql", "redis"), outcome, and elapsed wall-clock time. Every public method on Databasecreate, retrieve_object, retrieve_objects, retrieve_tags, update_object, delete, find, atomic, is_object_owned_by, grant_ops, remove_ops, etc. — was wrapped with this helper.

Both the counter (kms.db.operations.total) and the histogram (kms.db.operation.duration) carry the same three labels: operation, backend, outcome.




bdc30123 — HTTP Metrics Middleware

Metrics: kms.http.requests.total, kms.http.request.duration, kms.active.connections

What was done:

A new Actix-web middleware OtelHttpMetrics was created in otel_http_middleware.rs. It wraps every inbound request as the outermost App-level layer so it captures total latency including auth and routing.

On each request it:

  1. Increments kms.active.connections (an UpDownCounter) immediately before calling the inner service.
  2. Starts a std::time::Instant timer.
  3. Awaits the inner service — covers both success and error paths.
  4. Decrements kms.active.connections unconditionally in the completion path.
  5. Records kms.http.requests.total (counter) and kms.http.request.duration (histogram) with labels method, path, status.

A normalize_path() function maps the raw URI to a low-cardinality label set (e.g. all KMIP calls become /kmip/2_1, UI assets become /ui/...) to prevent metric cardinality explosion from hashed filenames or per-UID REST paths.

When OTLP is not configured, metrics is None and the entire middleware is a zero-overhead pass-through.


276c3b97 — Objects Total (Step 3, first pass)

Metrics: kms.objects.total, kms.keys.active.count

What was done:

Two new gauges were added to OtelMetrics: kms_objects_total (kms.objects.total) and active_keys_count (kms.keys.active.count). Both are plain i64_gauge instruments that record an absolute snapshot — no delta arithmetic.

⚠️ human note : I didn't quite get 'why' the variable was coded as delta, this was just a source of complicated sub-optimal code. If this should be reverted, please explain why

The ObjectsStore trait gained a count_all_non_destroyed() default method. At this point the default just returned 0 with a warn!() log so it was safe to deploy without breaking backends that hadn't implemented it yet.

The existing cron.rs metrics loop was extended to call kms.database.count_all_non_destroyed_objects() every 30 seconds and feed the result into metrics.update_objects_total(count).

This commit still carried a TODO note on the cron active-keys path because the old Locate-based approach was known to undercount in multi-admin deployments; that was addressed in step 3-bis.


16043a5f — Redis-Findex Backend for Object Count (Step 3-bis, part 1)

Metrics: kms.objects.total, kms.keys.active.count

What was done:

The Redis-findex backend (redis) received a real implementation of count_all_non_destroyed. Because Redis has no native COUNT WHERE state != Destroyed query, a dedicated O(1) counter key was introduced in the ObjectsDb layer:

  • adjust_live_count(delta: i64) — increments or decrements the Redis counter atomically alongside every create/delete/destroy operation.
  • get_live_count() / set_live_count() — read and overwrite the counter for reconcile/repair.
  • scan_count_non_destroyed() — full scan fallback that iterates all keys and counts those not in Destroyed state, used to seed or correct the fast-path counter.

The fast-path counter is kept consistent in the hot path; a periodic full-scan reconcile (every 5 minutes, added in the follow-up commit) corrects any drift from partial failures.

Integration tests were added in additional_redis_findex_tests.rs to exercise create/destroy sequences and verify the count stays accurate.


e52832fa — Step 3-bis: Full Objects Count + Old Cron Cleanup

Metrics: kms.objects.total, kms.keys.active.count

What was done:

This is the commit that closed the multi-admin correctness problem noted in step 3.

The old cron.rs logic that used a KMIP Locate call filtered to State::Active was deleted. That approach had an inherent ACL bias: if the cron user didn't own all keys, the count would undercount. It was replaced with a privileged backend call — kms.database.count_non_destroyed_key_objects() — which queries the raw store directly, bypassing KMIP access control.

A new 5-minute reconcile_interval ticker was added to cron.rs. It calls kms.database.reconcile_all_object_counts(), which is a no-op for SQL backends (they compute the count on-the-fly from a real SQL COUNT) but repairs the Redis O(1) counter by running scan_count_non_destroyed() and overwriting the fast-path key if it has drifted.

SQL backends (SQLite, PostgreSQL, MySQL) received real count_all_non_destroyed implementations using a SELECT COUNT(*) WHERE state != 'Destroyed' query added to query.sql and query_mysql.sql. The HsmStore implementation was also wired, delegating to count_non_destroyed_keys() (see next commit).


8ebecb41 — Step 3-bis: HSM Object Count + Tests

Metrics: kms.objects.total (via HSM object store path)

What was done:

The HsmStore's count_all_non_destroyed() override changed from returning a hard-coded 0 (with a suppressed warning) to calling self.count_non_destroyed_keys(). The reasoning: on an HSM, deleting a key physically removes it from the device — there is no Destroyed state — so every key present in a slot is by definition non-destroyed. Therefore delegating to the existing key-count path is semantically correct.

A full mockall-generated MockHsm test double was added to hsm_store.rs tests, and a new test verified that count_all_non_destroyed returns the same value as count_non_destroyed_keys for a mock HSM with two slots (3 + 2 keys = 5 total). This ensures the method is never accidentally hard-coded to 0 again.


e7eb0ffc — Cache & HSM Operation Metrics (Step 4)

Metrics: kms.cache.operations.total, kms.hsm.operations.total

What was done:

Cache side: Three recording points were added in other_kms_methods.rs inside get_unwrapped_object — the single code path that owns all cache interactions. A hit calls record_cache_operation("get", "hit"), a miss calls record_cache_operation("get", "miss"), and a successful insert calls record_cache_operation("insert", "ok"). Labels are operation and result.

HSM side: Recording was placed at the real crypto-oracle dispatch point in perform_crypto_operation in crypto_op.rs. When the resolved key routes to a ResolvedKey::Oracle branch (meaning the actual cryptographic operation runs on an HSM via PKCS#11), the code now calls metrics.record_hsm_operation(Op::OP_NAME, model) after the oracle returns. The model string ("softhsm2", "utimaco", etc.) is resolved by a new utility function hsm_model_from_prefix in uid_utils.rs, which looks up the configured HsmInstanceParams by UID prefix and returns the model field — or falls back to the raw prefix if no match is found. Wrap and Unwrap operations in wrap.rs / unwrap.rs were also wired individually with their own record_hsm_operation calls since they go through a separate code path. Labels are operation and model.

Closes #997

@HatemMn HatemMn linked an issue Jun 4, 2026 that may be closed by this pull request
9 tasks
@HatemMn HatemMn force-pushed the feat/richOTELmetrics branch from a49af1c to e7eb0ff Compare June 8, 2026 12:56
@HatemMn HatemMn changed the title Feat/rich ote lmetrics Feat/rich oteL lmetrics Jun 8, 2026
@HatemMn HatemMn marked this pull request as draft June 8, 2026 13:33
@HatemMn HatemMn self-assigned this Jun 8, 2026
@HatemMn HatemMn force-pushed the feat/richOTELmetrics branch from a7cbffc to 7561ac2 Compare June 9, 2026 09:32
@HatemMn HatemMn marked this pull request as ready for review June 9, 2026 16:38
@HatemMn HatemMn linked an issue Jun 10, 2026 that may be closed by this pull request
6 tasks
@HatemMn HatemMn force-pushed the feat/richOTELmetrics branch from 532e289 to 1e482d4 Compare June 11, 2026 19:10
@HatemMn HatemMn requested a review from Copilot June 15, 2026 09:04
@HatemMn HatemMn force-pushed the feat/richOTELmetrics branch from 9ff46bb to c0069d1 Compare June 15, 2026 09:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires previously-defined-but-dead OpenTelemetry metrics across the KMS server, spanning database, HTTP, object-count, cache, and HSM paths, and updates docs/tests to validate OTLP export behavior.

Changes:

  • Add DB operation count/duration instrumentation via an injected DbMetricsRecorder trait, implemented by OtelMetrics.
  • Add Actix middleware for HTTP request counters/duration and active connection tracking, plus path normalization to cap cardinality.
  • Implement privileged object/key counting across backends (SQL/Redis/HSM), add Redis O(1) counters + reconcile, update docs/changelog, and extend OTLP export tests.

Reviewed changes

Copilot reviewed 45 out of 46 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
nix/expected-hashes/ui.vendor.non-fips.sha256 Update pinned UI vendor hash (non-FIPS) for Nix builds.
nix/expected-hashes/ui.vendor.fips.sha256 Update pinned UI vendor hash (FIPS) for Nix builds.
nix/expected-hashes/server.vendor.static.sha256 Update pinned server vendor/static hash for Nix builds.
nix/expected-hashes/cli.vendor.linux.sha256 Update pinned CLI vendor hash for Nix builds.
monitoring/OTLP_METRICS.md Reformat/extend metrics documentation and add resource links.
documentation/mkdocs.yml Add nav entry for new “Metrics reference” page under Monitoring.
documentation/docs/configuration/otlp-metrics.md New metrics reference page detailing instruments, labels, and semantics.
documentation/docs/configuration/logging.md Link logging/telemetry doc to the new metrics reference.
crate/server/src/start_kms_server.rs Install OTEL HTTP metrics middleware at the App level.
crate/server/src/middlewares/otel_http_middleware.rs New Actix middleware for HTTP metrics + path normalization + tests.
crate/server/src/middlewares/mod.rs Export OtelHttpMetrics middleware from module.
crate/server/src/cron.rs Replace Locate-based key counting with privileged DB counts; add reconcile ticker.
crate/server/src/core/wrapping/wrap.rs Record HSM operation metric for wrap path using crypto oracle.
crate/server/src/core/wrapping/unwrap.rs Record HSM operation metric for unwrap path using crypto oracle.
crate/server/src/core/uid_utils.rs Add hsm_model_from_prefix helper + tests; re-export HSM instance params usage.
crate/server/src/core/otel_metrics.rs Implement gauges for object/key counts, add DB recorder impl, add more metric tests, cardinality cap behavior.
crate/server/src/core/operations/key_ops/crypto_op.rs Record HSM operation metric on oracle-routed crypto operations.
crate/server/src/core/kms/other_kms_methods.rs Record cache hit/miss/insert metrics in unwrap-cache path.
crate/server/src/core/kms/mod.rs Inject DB metrics recorder into Database; seed object/key gauges at startup; update otel reader/resource setup for otel 0.29.
crate/server/src/config/params/mod.rs Re-export HsmInstanceParams.
crate/server/src/config/mod.rs Re-export HsmInstanceParams from config module.
crate/server/Cargo.toml Add opentelemetry_sdk testing feature for metric tests.
crate/server_database/src/tests/mod.rs Register additional Redis-findex tests for new counters.
crate/server_database/src/stores/sql/sqlite.rs Implement privileged count queries + add loader and E2E tests for count query presence/correctness.
crate/server_database/src/stores/sql/query.sql Add named count queries for non-destroyed objects and keys (SQLite/PG variants).
crate/server_database/src/stores/sql/query_mysql.sql Add named count queries for non-destroyed objects and keys (MySQL).
crate/server_database/src/stores/sql/pgsql.rs Implement privileged count queries for objects and keys (PostgreSQL).
crate/server_database/src/stores/sql/mysql.rs Implement privileged count queries for objects and keys (MySQL).
crate/server_database/src/stores/redis/redis_with_findex.rs Maintain Redis O(1) live-object + key counters across CRUD/state/atomic; implement count + reconcile paths.
crate/server_database/src/stores/redis/objects_db.rs Add Redis counter keys + adjust/get/set + SCAN bootstrap counting helpers.
crate/server_database/src/stores/redis/additional_redis_findex_tests.rs Add integration tests for Redis live-object and active-key counters + bootstrap behavior.
crate/server_database/src/lib.rs Re-export DbMetricsRecorder.
crate/server_database/src/core/unwrapped_cache.rs Update tests to pass new Database::instantiate signature (recorder arg).
crate/server_database/src/core/mod.rs Add recorder field to Database; add MainDbKind::as_str; thread recorder through instantiation.
crate/server_database/src/core/db_metrics.rs New trait defining the DB→server metrics recording interface.
crate/server_database/src/core/database_permissions.rs Wrap permission DB facade methods with record(...).
crate/server_database/src/core/database_objects.rs Add central record(...) wrapper; wrap object-store facade methods; add privileged count/reconcile facade APIs + recorder test.
crate/interfaces/src/stores/objects_store.rs Extend ObjectsStore with default count/reconcile methods (metrics-only privileged APIs).
crate/interfaces/src/hsm/hsm_store.rs Implement HSM count methods; add mockall-based unit test ensuring count delegation/behavior.
crate/interfaces/Cargo.toml Add dev-deps for mock-based async tests.
CHANGELOG/feat_richOTELmetrics.md Branch changelog documenting metric wiring steps and rationale.
Cargo.toml Bump OpenTelemetry crates to 0.29 workspace-wide.
Cargo.lock Lockfile updates for OTel bump + new dev-dependencies (mockall, etc.).
.github/scripts/test/test_otel_export.sh Extend OTEL export integration script to assert new metric families are non-zero and labels present; add wrapped-key scenario for cache metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crate/server_database/src/core/database_objects.rs
Comment thread crate/server/src/cron.rs
Comment thread monitoring/OTLP_METRICS.md Outdated
Comment thread documentation/docs/configuration/otlp-metrics.md Outdated
Comment thread crate/server/src/middlewares/otel_http_middleware.rs
@HatemMn HatemMn force-pushed the feat/richOTELmetrics branch from 7bd470d to 58e3f27 Compare June 16, 2026 09:28
Comment thread .github/reusable_scripts
Comment thread monitoring/OTLP_METRICS.md
Comment thread test_data
@HatemMn HatemMn force-pushed the feat/richOTELmetrics branch from 58e3f27 to 0dc7468 Compare June 16, 2026 13:39
Comment thread .github/scripts/test/test_otel_export.sh
Comment thread test_data
Comment thread monitoring/OTLP_METRICS.md
Comment thread test_data
@Manuthor Manuthor enabled auto-merge (squash) June 16, 2026 16:43
@Manuthor Manuthor merged commit 363b954 into develop Jun 17, 2026
71 of 72 checks passed
@Manuthor Manuthor deleted the feat/richOTELmetrics branch June 17, 2026 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wire 8 dead metrics: db/http/connections/objects/cache/hsm

3 participants